<html>
<head>
<title>Standard Discrimination Handling Scenarios</title>
</head>
<body>

<h1>Standard Discrimination Handling Scenarios</h1>

<p align=center><em>Date: 2009-06-17; for BOXER ver. 0.6.004</em></p>

<p>This document lists a number of major patterns of use of BOXER
("use case").  It briefly describes the sequence of actions an API user can use
for each, including when and how many times parsing is done, whether
it'done to single examples or a batch of them at each parsing
step, whether parsing is definitional or not, etc.<p>

<h2>Notes</h2> 


<p>Most exampes in this document assume that the XML input is well-formed, and
that is consistent with BOXER's XML syntax and semantics. If that is
not the case, an <tt>org.xml.sax.SAXException</tt>
or <a href="api/edu/dimacs/mms/boxer/BoxerXMLException.html">BoxerXMLException</a>,
respectively, will be thrown by BOXER. If desired, these exception can
be caught by the user's code using the appropriate Java try/catch syntax.

<p>
In the examples in this document where a series of XML datapoint
definitions are handled, we will assume, for simplicity, that the
examples have been supplied in an array <tt>org.w3c.dom.Element[]
elements</tt>, where each element has the type <tt>datapoint</tt>. In
practice, of course, it will be more common to read examples from a
single XML file (containing a single <tt>dataset</tt> element), and
process datapoint elements as they are encountered. Thus, the more
likely overall framework for such processing may look like this:
<pre>
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.*;
import org.xml.sax.SAXException;

  ....
  // Set up a logger for your error messages (or you can just use stderr...)
  Logger logger = Logger.getLogger("MyLog"); 

  ...
  // start reading and validating individual examples from a "datapoint" element
  org.w3c.dom.Element e = ParseXML.readFileToElement("dataset.xml"); 
  Vector<Element> dpElements = new Vector<Element>;

  int cnt = 0;
  boolean error = false;
  for(Node n = e.getFirstChild(); n!=null; n = n.getNextSibling()) {
    int type = n.getNodeType();
    String val = n.getNodeValue();
	    
    if (type == Node.ELEMENT_NODE && 
        n.getNodeName().equals(ParseXML.NODE.DATAPOINT)) {
       // parse a data point with its labels against the "validator" suite
       try {
          Element pe = (Element)n;  // the "datapoint" element 
          dpElements.add(pe);  // save the "datapoint" element, if desired
          .... <em>process the datapoint element here</em> ...
          cnt ++; // keep going
       } catch (Exception e) {
          logger.error("Error when validating the " + (cnt+1) + "-th example: " + e);
           error = true;
           break; // stop processing
       }
    } else {
       // extraneous stuff found in XML - may report parsing warning, if desired
    }

   // This is the array of "data point" elements which, 
   // in individual scenarios, we will assume already exists:
   Element elements[]= dpElements.toArray(new Element[0]);
</pre>

<h2>1. Validation Scenarios:</h2>

<ul>
<li><a href="scenario-1p1a.html">Scenario 1.1a</a> - finding all problematic entries in a set of examples
  
<li>Scenario 1.1b - <em>To appear...</em>

<!--

Scenario: User wants to check the validity of a set of XML
examples against Suite Foo. They care in a yes/no way whether the
entire set of examples is valid. If the set is not valid they would
like as much information as possible about each invalid example in the
set.  No further use of the examples happens at that time. Suite Foo
should not be changed by this process.

<P>In cases where the only invalidity is that we run out of
anonymous classes in a DCS2/Bounded discrimination before running out
of examples, the identified "invalid"  examples are dependent
on the order in which examples are processed. Other types of
invalidity, however, are not dependent on order. So it seems
desirable to have an option to issue warning-style exceptions and keep
processing.

<p><strong>Implementation</strong>

<p>
<em>Solution (a).</em> If the entire set of example is contained in a
single "dataset" XML element (typically, the top level element of a
dataset XML file), <strong>and</strong> the user only wants to learn
about the first invalid example (without parsing the rest of the set),
one can validate it as follows:

<pre>
  import boxer.*;

  ...
  Suite suite = new Suite("foo.xml"); // create suite...
  // modify suite by any preceding definitional parsing etc
  ....                                
  // validate the set of examples from dataset XML
  // a "datapoint" element
  String f ="dataset.xml"; 
  org.w3c.dom.Element e = ParseXML.readFileToElement(f); 
  int n=ParseXML.validateDatasetElement(e, suite, true);
  if (n &gt; 0) {
	System.out.println("[VALIDATE] The data set from file " + f + 
        " appears to be fully acceptable as a training set in the current "+
        "suite. It contains " + n + " data points");
  } else {
	 System.out.println("[VALIDATE] It would not be possible to parse "+
         "the data set from file " + f + 
         " as a training set in the current suite. Please see a warning "+
         "message in the log for detals");
  }

</pre>	

<p> In the example above, method
<tt>ParseXML.validateDatasetElement(e, suite, true)</tt> creates
(internally) a "light-weight copy" of the suite, and tries to carry
out "definitional parsing" of the <tt>dataset</tt> element <tt>e</tt>
against that copy (i.e., modifying it - adding classes etc - as it
goes, whenever necessary and appropriate). If the process fails (e.g.,
trying to add a class to a DCS1 (=Fixed) discrimination), an exception
is caught within this method, and it returns -1.

<p><em>Solution (b).</em> To validate <em>all</em> examples - even the
ones the follow the first problematic one - one can use the method
ParseXML.validateDatasetElement2(e, suite, true), as follows:


<pre>
  import boxer.*;

  ...
  Suite suite = new Suite("foo.xml"); // create suite...
  // modify suite by any preceding definitional parsing etc
  ....                                
  // validate the set of examples from dataset XML
  // a "datapoint" element
  String f ="dataset.xml"; 
  org.w3c.dom.Element e = ParseXML.readFileToElement(f); 
  Vector&lt;Object&gt; v=ParseXML.validateDatasetElement2(e, suite, true);

  // scan the list of the results
  int nGood = 0, nBad = 0;
  for(int i=0; i &lt; v.size(); i++) {
     Object o = v.elementAt(i);
     if (o instanceof DataPoint) {
        nGood ++;
     } elsif (o instanceof Exception) {			   
        nBad ++;
        System.out.println("Example No. " + i + " is problematic: " + o);
     }
  }
    
  if (nBad==0) {
	System.out.println("[VALIDATE] The data set from file " + f + 
        " appears to be fully acceptable as a training set in the current "+
        "suite. It contains " + nGood + " data points");
  } else {
	 System.out.println("[VALIDATE] It would not be possible to parse "+
         "the data set from file " + f + 
         " as a training set in the current suite. Found " + nGood + 
         " acceptable examples, and " + nBad + " unacceptable ones. " +
         "See messages above for details");
  }
</pre>	


<p><em>Solution (c).</em> If you need to validate a sequence of
examples (e.g., spread over 3 dataset elements), but, again, only want to learn about the first invalid example in the sequence, you will have to
explicitly carry out the same process
that <tt>ParseXML.validateDatasetElement(...)</tt>  does internally,
viz.:
<pre>
  Suite suite = new Suite("foo.xml"); // create suite...
  // modify suite by any preceding definitional parsing etc
  ....                                
  // validate the set of examples from dataset XML
 
  // create a lightweight copy of the suite (so that any modifications
  // that may happen during validation will only affect this copy, and not
  // your "main" suite
  Suite validator = suite.lightweightCopyOf("Validator");
 
  String filenames[] = { "dataset1.xml", "dataset2.xml", ...};
  for( String f : filenames) {
    try {
      Element e=ParseXML.readFileToElement("dataset.xml");
      Vector&lt;DataPoint&gt; v = parseDatasetElement(e, validator, isDefinitional);
       int  n = v.size();
       System.out.println("All " + n + " examples from file " + f + 
       " can be definitionally parsed");
    } catch (Exception ex) {
	    Logging.warning("It would not be possible to parse the entire data set "+
            f+" in the context of suite '"+suite.getName()+
            "' , because of the following problem: "  + ex);
    }
</pre>

<p>Note that, in all examples above, only the suite <tt>validator</tt>
(the "lightweight copy" of the "main suite" <tt>suite</tt>) is
modified. Therefore, once the validation has completed, you still need
to parse the dataset against the main suite.
-->

<li><a href="scenario-1p2.html">Scenario 1.2</a> - parsing XML examples until the first problematic one

</ul>

<h2>2. Validation &amp; Definition</h2>

<ul>
<li><a href="scenario-2p1.html">Scenario 2.1.</a> - validate a set of examples, and if they all are acceptable, parse them in the context of one's Suite, updating Discrimination definitions as required

<li><a href="scenario-2p2.html">Scenario 2.2</a> - process examples, updating the Discrimination definitions as required, until the first unacceptable example is encountered
</ul>

<h2>3. Validation &amp; Definition &amp; Training</h2>

<ul>
<li><a href="scenario-3p1.html">Scenario 3.1</a> - validate all examples, and if all are acceptable, parse them in the context of the main Suite, and train the learner(s) on them

<li><a href="scenario-3p2.html">Scenario 3.2</a> - validate all examples; parse all valid examples in the context of the main Suite, and train the learner(s) onthem, ignoring the invalid ones. 
</ul>

<h2>4. Validation &amp; Definition &amp; Model Application</h2>

<ul>
<li><a href="scenario-4p1.html">Scenario 4.1</a> - if the entire data
set is acceptable, parse each example in the context of the main Suite
(<em>updating</em> Discrimination definitions as one goes, as if these
were training examples), and and apply the learner's model to each
example.

<li><a href="scenario-4p2.html">Scenario 4.2</a> - until the first 
unacceptable example is encountered, parse each example in the context
of the main Suite (<em>updating</em> Discrimination definitions as one
goes, as if these were training examples), and and apply the learner's
model to each example.

</ul>

<h2>5. Validation &amp; Model Application Only [ANOMALOUS?]</h2>

<ul>
<li><a href="scenario-5p1.html">Scenario 5.1</a> - if all examples are
acceptable, apply the current learner's model to each example, without
updating Discrimination definitions.

<li><a href="scenario-5p2.html">Scenario 5.2</a> - process examples
one at a time, until the first unacceptable example is found. Until
then, apply the current learner's model to each example, without
updating Discrimination definitions.
</h3>
</ul>


<h2>
6. Validation Plus Classical Online Learning (&quot;Test, then Train&quot;)
</h2>

<ul>
<li><a href="scenario-6p1.html">Scenario 6.1</a> - if the entire
dataset is acceptable, parse each example in the context of the main
Suite (updating Discrimination definitions), apply the current
learner's model to it, and then train the learner on it.

<li><a href="scenario-6p2.html">Scenario 6.2</a> - process the entire
dataset sequentiall, skipping unacceptable examples. For each
acceptable example, parse it in the context of the main Suite
(updating Discrimination definitions), apply the current learner's
model to it, and then train the learner on it.
</ul>

<h2>7. Validation Plus TREC Filtering Style Online Learning</h2>

<ul>
<li><a href="scenario-7p1.html">Scenario 7.1</a> - if the entire
dataset is acceptable, process the each example, updating
Discirmination definitions (as if this were a training example) and
applying the learner's model to the example. If the predictions made
by the model are deemed apropriate, train the learner on this example.

<li><a href="scenario-7p2.html">Scenario 7.2</a> - process the entire
dataset sequentiall, skipping unacceptable examples. For each
acceptable example, process the each example, updating
Discirmination definitions (as if this were a training example) and
applying the learner's model to the example. If the predictions made
by the model are deemed apropriate, train the learner on this example.

</ul>

<h2>See also</h2>
<ul>
<li><a href="boxer-user-guide.html">BOXER User Guide</a>
<li><a href="tags.html">Overview of the XML elements used by BOXER</a>
<li><a href="nd.html">Treatment of new discrimination/class labels when parsing a data set</a>
</ul>


</body>
</html>
